Linear regression. Daniel Hsu (COMS 4771) (y i x T i β)2 2πσ. 2 2σ 2. 1 n. (x T i β y i ) 2. 1 ˆβ arg min. β R n d

Liear regressio Daiel Hsu (COMS 477) Maximum likelihood estimatio Oe of the simplest liear regressio models is the followig: (X, Y ),..., (X, Y ), (X, Y ) are iid radom pairs takig values i R d R, ad Y X = x N(x T β, σ 2 ), x R d. Here, the vector β R d ad scalar σ 2 > 0 are the parameters of the model. (The margial distributio of X is uspecified.) The log-likelihood of (β, σ 2 ) give (X i, Y i ) = (x i, y i ) for i =,..., is { l (y i x T i β)2 2πσ 2 2σ 2 } + T, where T is some quatity that does ot deped o (β, σ 2 ). Therefore, maximizig the log-likelihood over β R d (for ay σ 2 > 0) is the same as miimizig So, the maximum likelihood estimator (MLE) of β i this model is (It is ot ecessarily uiquely determied.) Empirical risk miimizatio ˆβ arg mi β R d Let P be the empirical distributio o (x, y ),..., (x, y ) R d R, i.e., the probability distributio over R d R with probability mass fuctio p give by p ((x, y)) = {(x, y) = (x i, y i )}, (x, y) R d R. The distributio assigs probability mass / to each (x i, y i ) for i =,..., ; o mass is assiged aywhere else. Now cosider ( X, Ỹ ) P. The expected squared loss of the liear fuctio β R d o ( X, Ỹ ) is R(β) := E[( X T β Ỹ )2 ] = we call this the empirical risk of β o the data (x, y ),..., (x, y ). (x T i β y i ) 2 ;

Empirical risk miimizatio is the method of choosig a fuctio (from some class of fuctios) based o data by choosig a miimizer of the empirical risk o the data. I the case of liear fuctios, the empirical risk miimizer (ERM ) is ˆβ arg mi β R d R(β) = arg mi β R d This is the same as the MLE from above. (It is ot ecessarily uiquely determied.) Normal equatios Let We ca write the empirical risk as The gradiet of R is give by A := it is equal to zero for β R d satisfyig x T. x T, b := R(β) = Aβ b 2 2, β R d. y. y. R(β) = {(Aβ b) T (Aβ b)} = 2A T (Aβ b), β R d ; A T Aβ = A T b. These liear equatios i β, which defie the critical poits of R, are collectively called the ormal equatios. It turs out the ormal equatios i fact determie the miimizers of R. To see this, let ˆβ be ay solutio to the ormal equatios. Now cosider ay other β R d. We write the empirical risk of β as follows: R(β) = Aβ b 2 2 = A(β ˆβ) + A ˆβ b 2 2 = A(β ˆβ) 2 2 + 2(A(β ˆβ)) T (A ˆβ b) + A ˆβ b 2 2 = A(β ˆβ) 2 2 + 2(β ˆβ) T (A T A ˆβ A T b) + A ˆβ b 2 2 = A(β ˆβ) 2 2 + A ˆβ b 2 2 R( ˆβ). The secod-to-last step above uses the fact that ˆβ is a solutio to the ormal equatios. Therefore, we coclude that R(β) R( ˆβ) for all β R d ad all solutios ˆβ to the ormal equatios. So the solutios to the ormal equatios are the miimizers of R. Statistical iterpretatio Suppose (X, Y ),..., (X, Y ), (X, Y ) are iid radom pairs takig values i R d R. The risk of a liear fuctio β R d is R(β) := E[(X T β Y ) 2 ]. Which liear fuctios have smallest risk? The gradiet of R is give by R(β) = E [ {(X T β Y ) 2 } ] = 2E [X(X T β Y )], β R d ; it is equal to zero for β R d satisfyig E[XX T ]β = E[Y X]. 2

These liear equatios i β, which defie the critical poits of R, are collectively called the populatio ormal equatios. It turs out the populatio ormal equatios i fact determie the miimizers of R. To see this, let β be ay solutio to the populatio ormal equatios. Now cosider ay other β R d. We write the empirical risk of β as follows: R(β) = E[(X T β Y ) 2 ] = E[(X T (β β ) + X T β Y ) 2 ] = E[(X T (β β )) 2 + 2(X T (β β ))(X T β Y ) + (X T β Y ) 2 ] = E[(X T (β β )) 2 ] + 2(β β ) T (E[XX T ]β E[Y X]) + E[(X T β Y ) 2 ] = E[(X T (β β )) 2 ] + E[(X T β Y ) 2 ] R(β ). The secod-to-last step above uses the fact that β is a solutio to the populatio ormal equatios. Therefore, we coclude that R(β) R(β ) for all β R d ad all solutios β to the populatio ormal equatios. So the solutios to the populatio ormal equatios are the miimizers of R. The similarity to the previous sectio is o accidet. The ormal equatios (based o (X, Y ),..., (X, Y )) are precisely E[ X X T ]β = E[Ỹ X] for ( X, Ỹ ) P, where P is the empirical distributio o (X, Y ),..., (X, Y ). By the Law of Large Numbers, the left-had side E[ X X T ] coverges to E[XX T ] ad the right-had side E[Ỹ X] coverges to E[Y X] as. I other words, the ormal equatios coverge to the populatio ormal equatios as. Thus, ERM ca be regarded as a plug-i estimator for β. Usig classical argumets from asymptotic statistics, oe ca prove that the distributio of ( ˆβ β ) coverges (as ) to a multivariate ormal with mea zero ad covariace E[XX T ] cov(εx)e[xx T ], where ε := Y X T β. (This assumes, alog with some stadard momet coditios, that E[XX T ] is ivertible so that β is uiquely defied. But it does ot require the coditioal distributio of Y X to be ormal.) Geometric iterpretatio Let a j R be the vector i the j-th colum of A, so A = a a d. Sice rage(a) = {Aβ : β R d }, miimizig Aβ b 2 2 is the same as fidig the vector ˆb rage(a) closest to b (i Euclidea distace), ad the specifyig the liear combiatio of a,..., a d that is equal to ˆb, i.e., specifyig ˆβ = ( ˆβ,..., ˆβ d ) such that ˆβ a + + ˆβ d a d = ˆb. The solutio ˆb is the orthogoal projectio of b to rage(a). This vector ˆb is uiquely determied; however, the coefficiets ˆβ are uiquely determied if ad oly if a,..., a d are liearly idepedet. The vectors a,..., a d are liearly idepedet exactly whe the rak of A is equal to d. We coclude that the empirical risk has a uique miimizer exactly whe A has rak d. Fixed desig aalysis It is somewhat mathematically easier to study liear regressio i the fixed desig settig tha it is i the usual settig of machie learig. I the fixed desig settig, we are give the followig. 3

. A desig matrix A := [x x ] T R d, which is ot radom. (The vectors x i R d are the colums of A T.) For simplicity, we ll assume that rak(a) = d. 2. A radom respose vector b = (b,..., b ) := (Y,..., Y )/, where Y,..., Y are ucorrelated (i.e., cov(y i, Y j ) = 0 for i j) real-valued radom variables. Let µ := (µ,..., µ ), where µ i := E(Y i / ). For simplicity, we ll assume var(y i ) = σ 2. The goal is to fid a liear fuctio β R d such that the (fixed desig) risk is as small as possible. R(β) := (x T i β E(Y i )) 2 = Aβ µ 2 2 Fixed desig risk miimizers ad ordiary least squares The miimizers of R are the vectors β R d that satisfy the followig system of liear equatios: A T Aβ = A T µ. But ote that this system of liear equatios depeds o µ, which is ukow. Istead, we oly have the (radom) vector of resposes b. What we ca do is to fid a vector ˆβ R d that satisfies A T A ˆβ = A T b, which is the same as the previous system of liear equatios, except µ is replaced by b. This approach is called ordiary least squares: ˆβ is chose to be a miimizer of R(β) := (x T i β Y i ) 2 = Aβ b 2 2. Sice we assume rak(a) = d, it follows that A T A is ivertible. Hece β ad ˆβ are both uiquely determied by the followig formulae: β = (A T A) A T µ, ˆβ = (AT A) A T b. Moreover, we see that E[A ˆβ] = Aβ by liearity of expectatio ad the fact E[b] = µ. Risk of ordiary least squares What is the (fixed desig) risk of ˆβ? Oe simplificatio comes from usig the defiitio of β : R( ˆβ) = A ˆβ µ 2 2 = A( ˆβ β ) + Aβ µ 2 2 (addig ad subtractig Aβ ) = A( ˆβ β ) 2 2 + 2( ˆβ β ) T A T (Aβ µ) + Aβ µ 2 2 (expadig the square) = A( ˆβ β ) 2 2 + R(β ). (usig the fact A T Aβ = A T µ) So the differece betwee the risk of ˆβ ad the (optimal) risk of β is precisely A( ˆβ β ) 2 2. Note that R( ˆβ) is a radom variable because ˆβ is a radom vector (depedig o b). The expected value of R( ˆβ) R(β ) is E[R( ˆβ) R(β )] = E A( ˆβ β ) 2 2 = E A(A T A) A T (b µ) 2 2 (usig formulae for ˆβ ad β ) = E Π(b µ) 2 2 4

where Π := A(A T A) A T is the orthogoal projectio operator for the rage of A. Expadig Π(b µ) 2 2 = (b µ) T Π T Π(b µ) = (b µ) T Π(b µ) ad takig expectatios, we obtai where the last step uses the fact that E Π(b µ) 2 2 = = j= j= = σ2 Π i,j E[(b i µ i )(b j µ j )] Π i,j cov(b i, b j ) Π i,i, cov(b i, b j ) = cov(y i, Y j ) = { σ 2 / if i = j, 0 if i j. The sum of the diagoal etries of Π is the trace of Π, writte tr(π). The trace of a symmetric matrix is equal to the sum of its eigevalues. Sice a orthogoal projectio matrix has eigevalues either 0 or, ad the umber of eigevalues equal to oe is exactly its rak, it follows that tr(π) = rak(π) = rak(a) = d. We have show that Droppig the simplifyig assumptios E[R( ˆβ)] = R(β ) + σ2 d. Suppose we drop the assumptio that var(y i ) = σ 2 for all i. The the same argumets from above ca be used to prove E[R( ˆβ)] R(β ) + σ2 d where σ 2 := max i var(y i ). Now suppose we drop the assumptio that rak(a) = d. Let r deote the rak of A, ad let U R r be ay matrix whose colums spa the rage of A. The ay vector of the form Aβ ca be writte as Uα for some α R r. The the same argumets from above ca be applied to the fixed desig liear regressio problem with U R r i place of A R d, leadig to the expected risk boud E[R( ˆβ)] = R(β ) + σ2 r if var(y i ) = σ 2 for all i, ad for σ 2 := max i var(y i ). E[R( ˆβ)] R(β ) + σ2 r 5